CASi: A framework for cross-timepoint analysis of single-cell RNA sequencing data

Single-cell RNA sequencing (scRNA-seq) technology has been widely used to study the differences in gene expression at the single cell level, providing insights into the research of cell development, differentiation, and functional heterogeneity. Various pipelines and workflows of scRNA-seq analysis have been developed but few considered multi-timepoint data specifically. In this study, we develop CASi, a comprehensive framework for analyzing multiple timepoints’ scRNA-seq data, which provides users with: (1) cross-timepoint cell annotation, (2) detection of potentially novel cell types emerged over time, (3) visualization of cell population evolution, and (4) identification of temporal differentially expressed genes (tDEGs). Through comprehensive simulation studies and applications to a real multi-timepoint single cell dataset, we demonstrate the robust and favorable performance of the proposal versus existing methods serving similar purposes.


Artificial neural networks (ANNs)
Denote the scRNA-seq expression matrix of the training data by X0 where X0 is a p by n0 matrix with p being the total number of measured genes and n0 is the number of cells, and the corresponding training cell label, Y 0, is a n0 by 1 vector.Similarly, denote the expression matrix of the testing data by X1, which has dimensions p by n1 and the labels by Y 1.After the standard min-max normalization, we select the top 2000 most variable genes from X0 and the same set of features from X1. Next, using Keras (Chollet, 2016), an open-source software library that provides a Python interface for artificial neural networks, we train a neural network model with one input layer, one output layer, and three hidden layers: will be estimated during the training process.And z l for l = {1, 2, 3} are the hidden neurons with corresponding weight W 1, and bias β 1 .σ(•) is the activation function, which can be a sigmoid, a rectified linear unit (ReLU), a hyperbolic tangent, etc.We choose to use the ReLU function in our hidden layers because the neural networks based on an ReLU function are generally easier to train and can avoid the vanishing gradient problem during optimization (Eckle and Schmidt-Hieber, 2019); it is mathematically expressed as σ ReLU (x) = max(x, 0).The SoftMax function will be used in the output layer.This is because the number of output categories is more than two, and it converts the values of the output layer into the predicted probabilities of each label.The number of neurons in the three hidden layers is selected as {256, 128, 64}.The model is trained using a stochastic gradient descent (SGD)-based algorithm with the mean squared error loss function L (Pr(y 0 ), Y0) = Pr(y 0 ) − Y0 2 .We use Adam as the optimization algorithm (Kingma and Ba, 2014), and the minibatch training strategy (Li et al., 2014), which randomly trains a small proportion of samples and validates the rest of the samples in each iteration to improve training efficiency.By monitoring the loss, we implement the early stopping rule in Keras to avoid overfitting.Once the model performance stops improving for a couple epochs, the training process will stop.Additionally, to further prevent the overfitting issue, we add a dropout step with the dropping rate of 0.4 for each hidden layer to randomly drop units from the neural network during training.A one-time training process of ANN is shown in Fig. 1.

Three scenarios
The simulation data are generated by assuming three sampling time points: t0, t1, t2.To fully evaluate our method, we designed three scenarios: 1) t0, t1, and t2 data contain the same cell types but with different cell type compositions; 2) a cell type in t0 disappears in t1 and t2 data, i.e., t0 data have one more cell type than t1 and t2 data; 3) a new cell type appears in t1 and t2 data, i.e., t1 and t2 data have one more cell type than t0 data.We obtain a publicly available dataset of peripheral blood mononuclear cells (PBMC) (Zheng et al., 2017), containing more than 60,000 sorted cells from eight immune cell types.We randomly extract cells from five cell types and use different cell type compositions for different scenarios.Detailed cell compositions are presented in Table 1.500 400 300 500 400 300 500 400 300 cytotoxic T 500 400 300 500 400 300 500 400 300

Generate tDEGs
To evaluate the performance of identifying tDEGs, we conduct extensive simulations by manually creating tDEGs and calculating the true discovery rate.Considering that different types of cells will have very different gene expression profiles, we only extract monocyte cells from the PBMC data to do simulation.We draw 2000 genes and 900 monocyte cells.Among 2000 genes, 300 genes are randomly chosen to be tDEGs.First, we add group effect.Among 900 cells, half of the cells are assigned to be responders by multiplying the baseline gene expression by U nif (1.5, 2), i.e., a uniform distribution of min = 1.5 and max = 2, while the other half of the cells are assigned to be non-responders and their baseline gene expressions stay the same.Next, we divided the whole data into three time points (t0, t1, and t2); namely we will have 150 responder cells and 150 non-responder cells at each time point.We manually assign 300 genes to be the tDEGs by adding the time effect.We design three levels of time effect: weak, medium, and strong.For the weak time effect scenario, the baseline gene expression of t0, t1, and t2 cells are multiplied by U nif (0.8, 1), U nif (1, 1.2), and U nif (1.2, 1.4), respectively.For the medium time effect scenario, the baseline gene expression of t0, t1, and t2 cells are multiplied by U nif (0.4, 0.6), U nif (1, 1.2), and U nif (1.6, 1.8), respectively.For the strong time effect scenario, the baseline gene expression of t0, t1, and t2 cells are multiplied by U nif (0.4, 0.6), U nif (1.2, 1.4), and U nif (2, 2.2), respectively.

CASi avoids the overclustering issue
The first step of CASi uses the neural network classifier to achieve cross-time points cell annotation with high accuracy.And as a supervised learning method, it efficiently avoids the overclustering issue.Using the same scenario settings of the simulation, we compare ARI (adjusted rand index) of unsupervised clustering implemented in the Seurat package with ARI of CASi using supervised clustering.The results of three scenarios are shown in Fig. 11.It can be observed that our method's ARI increases with the cell number increasing, while the Seurat ARI decreases with the cell number increasing.This indicates that an overclustering issue does exist in unsupervised clustering methods, such as Seurat, and supervised clustering methods, such as CASi, are able to avoid this issue.

Fig. 4 :
Fig. 4: The UMAP plot of t2 data, with the purple cluster representing the novel cell type and the grey cluster representing existed cell types.

Fig. 5 :
Fig. 5: The UMAP plot of t2 data, showing the correlation levels between existing cell types in t0 data and new cell types in t2 data.

Fig. 6 :
Fig. 6: The scmap fish plot of scenario 3 in which a novel cell type appears.We use B cells as the novel cell type which shows in green.And scmap fails to capture any novel cell.

Fig. 9 :
Fig.9: MCL real-world data: the hallmark pathway results of enrichment analysis using 500 temporal differentially expressed genes identified by CASi.

Fig. 10 :
Fig.10: MCL real-world data: the reactome pathway results of enrichment analysis using 500 temporal differentially expressed genes identified by CASi.

Fig. 11 :
Fig. 11: Adjusted rand index (ARI) results of simulation study using different total cell numbers based on 200 repetitions.

Table 1 .
Simulation: cell type compositions